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Abstract 

The multi-frontal direct solver is the state-of-the-art algorithm for the direct solution of sparse linear systems. 
This paper provides computational complexity and memory usage estimates for the application of the multi- 
frontal direct solver algorithm on linear systems resulting from B-spline-based isogeometric finite elements, 
where the mesh is a structured grid. Specifically we provide the estimates for systems resulting from C p_1 
polynomial B-spline spaces and compare them to those obtained using C° spaces. 
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1. Introduction 

The main purpose of this paper is to explain in detail how one can derive estimates for computational 
complexity and memory usage of the multifrontal direct solver algorithm as applied to B-spline-based [TJ [5] 
isogeometric finite elements [3J 0] . In this paper we generalize results already published for C° finite element 
spaces for the multi- frontal solver [6] . 

The restriction of this work to structured grid meshes is due to the target application being B-spline-based 
isogeometric finite elements. Isogeometric analysis is a relatively new method that has been the subject of 
much work in recent years. While at its core it is spline-based isoparametric finite element analysis, the 
spaces used possess unique refinement strategies that form spaces that are supersets of conventional finite 
elements. The mesh in B-spline-based isogeometric analysis is always a structured grid of uniform polynomial 
order. This simplification allows for specialized complexity and memory usage estimates to be developed 
for these spaces when the direct solver is the multi-frontal solver. 

The higher continuous basis is known to approximate smooth functions (e.g. solutions to PDEs) with 
orders of magnitude fewer degrees of freedom than their C° counterparts. While this is a promising result, 
the linear systems resulting from inner products of the higher continuous basis functions are also orders of 
magnitude more expensive to solve. This is something we have addressed in a previous paper [7]. In this 
paper we explain the derivation of the estimates in more detail and extend them to all spatial dimensions. 

First we explain the concepts behind the multifrontal direct solver algorithm and highlight why the 
algorithm is well-suited to handle linear systems resulting from traditional C'° finite elements spaces. We 
emphasize the perspective that the algorithm may be viewed as early LU -factorization. Second we evaluate 
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computational complexity of a single level of the multifrontal algorithm. Last we generalize the single level 
result to all levels and provide estimates for C° and C p_1 B-splines spaces. 



2. Multi-frontal direct solver algorithm 

The state of the art direct solver for sparse linear systems is the multi-frontal solver proposed by [H |9] . 
It is the generalization of the frontal solver algorithm proposed by [TO]- Note that while in [10], the original 
idea was developed for finite elements, the generalization applies to general sparse linear systems. 

The key observation in the algorithm is that LC/-factorization may be started during assembly on portions 
of the linear system that are fully assembled on a local level. The general algorithm first examines the matrix 
sparsity pattern to locate fully assembled matrix blocks which are loosely connected to the remaining part 
of the matrix. When used with finite elements, these blocks may be automatically determined using what 
is known about the support of the basis functions. 

In the context of finite elements, the multifrontal direct solver algorithm works in the following manner. 
We create the elemental matrices using standard finite element procedures. In the direct solver terminology, 
these elemental matrices are known as frontal matrices. The element matrices are typically assembled into 
a global matrix, where contributions from shared degrees of freedom with other elements are combined. 
However, depending on the topology and order of the finite elements, there are degrees of freedom which at 
the element level are fully assembled. For efficiency, these fully assembled degrees of freedom are eliminted in 
terms of the partially assembled ones at the element level, that is, at the frontal level. Since this procedure 
can be repeated concurrently at each element, it is usually known as multi-frontal. Particularizing this 
idea to finite elements, we will start with an example of a two element finite element mesh in any spatial 
dimension. 



2.1. Single level example: two element mesh 

Consider the partitioning of the elemental matrices in equation ([!]) . We reorder the elemental matrices 
by first listing those degrees of freedom that are fully assembled on the element level, x e , followed by those 
that are shared with other elements, y e , where subscript e refers to the element number. Thus the element 
matrix can be blocked accordingly to represent interactions between fully assembled degrees of freedom 
and those shared with other elements. Note that A e represents the block of interactions which are fully 
assembled at the element level, blocks B e and C e represent the interactions of fully assembled and shared 
degrees of freedom, and block D e represents interactions of shared degrees of freedom. Particularizing this 
for the two element mesh, we obtain 
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Because the block A e is fully assembled, we may begin the Lt/-factorization early and at the element level. 
Thus for each element, we can multiply the top row by C e A~ x and subtract from the bottom row, 
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Then, for each element e = 1,2, the block matrix D e — C^A^Be and the vector g e — C^A^B^ is assembled. 
After the contributions from both elements are assembled, we can solve for y. Once y is computed, we can 
resort to backward substitution at the element level to compute x e . 



2.2. Multilevel example: eight element mesh in three dimensions 

The procedure can be recursively generalized into multiple levels. For example, the procedure for an 
eight element mesh is shown in figure [l] The elimination proceeds as follows: 

1. Perform the local elimination of fully assembled degrees of freedom in each element as described in 
section [2~Tj Note that the degrees of freedom eliminated, if any, are those which have support only on 
the element, the so-called bubble functions. 



2 



2. Pair the eight elements into any 4 clusters where the pairs of elements share a common face. To these 
frontal matrices, we apply the algorithm again, that is, we eliminate the fully assembled degrees of 
freedom in terms of the remaining degrees of freedom shared with other elements. At this level, the 
degrees of freedom eliminated are those with support on the shared face. 

3. At the next level, we pair the four element clusters again into two which share a common interface. 
The recursive elimination procedure is applied at this level and repeated until we obtain a single cluster 
whose degrees of freedom are fully assembled (the top level in figure [I]). 

The connectivity graph describing the order of elimination and clustering is called the elimination tree. At 
this point, with the solution to the fully assembled system, we can move down the elimintation tree, using 
backward substitution to recover the remaining unknown degrees of freedom (those that were fully assembled 
at each elimination level). 




right front 



right rear 



Figure 1: The four levels of the elimination tree for a cube-shaped mesh with eight finite elements 



3. Computational complexity and memory usage 

Now we look at complexity comparing higher-continuous B-spline spaces to their C° counterparts. For 
simplicity we will only consider spaces which are C p_1 continuous, that is, spaces which have p— 1 continuous 
derivatives across element interfaces, where p refers to the polynomial order. Numerical tests indicate that 
these spaces form limiting cases in the performance of the direct solver algorithm. 

The examples in the previous section come from C° finite element spaces. In these examples, (p — l) d 
degrees of freedom may be eliminated at the first level, where d refers to the spatial dimension. For C p_1 
spaces, the supports of the basis functions spread into multiple elements and thus no degrees of freedom 
may be eliminated at the element level. To be able to eliminate a degree of freedom, we will cluster p + 1 
elements in each spatial dimension together and use this on the first level for C v ~ x spaces. This is the 
smallest grouping that can be obtained where a degree of freedom is fully assembled. 

In the remaining portion of this section we will develop the estimates for complexity and memory usage. 
Computational complexity will be estimated by counting floating point operations (FLOPS) and the memory 
usage will we counted as the bytes needed to store the LU factorization at each level of the elimination tree. 
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The main building block of the algorithm is the early L[/-factorization, also known as the Schur complement 
or static condensation. We will first develop the cost of the Schur complement, and then proceed to apply 
this to the full algorithm. 

3.1. Cost of the Schur complement 

In order to estimate the FLOPS and memory required to perform the above partial LU factorization, 
we will count operations and memory used in forming the Schur complement, as shown in equation ([2]). We 
denote the dimension of the square matrix A by q. We denote the number of columns in B and rows in C 
as r, where r is an assumed a constant. Then, we have: 

FLOPS = 0(q 3 + q 2 r + qr 2 ) = 0{q 3 + qr 2 ) 

Memory = 0{q 2 + qr) ^ ' 

The FLOPS estimate is obtained by counting the operations needed to find the LU factors of A, 0(q 3 ). To 
this we add the FLOPS required to perform r back-substitutions to form A~ 1 B, 0{rq 2 ). Finally, we add 
the cost of matrix multiplication of C to A~ 1 B, 0(qr 2 ). The memory estimate is obtained by adding the 
memory needed to store the LU factors of the matrix A, 0(q 2 ), to that required to store A~ X B and CA^ 1 , 
0{rq). 

In the above memory estimate, we are only concerned with the space required to store L and U, since it 
is well-known that the cost of storing original matrix is always smaller or equal than the memory required 
to store factors L and U. In particular, we have not included the memory required to store the Schur 
complement, since this is replaced in the next steps of LU factorization by additional Schur complement 
operations. 

3.2. Cost of the multi-frontal solver 

We divide our computational domain in N c clusters of elements. For the C° case, each cluster is simply 
an element, while for C p_1 , each cluster is a set of p + 1 consecutive elements in each spatial dimension. 
We assume for simplicity that the number of clusters in our computational domain is (2 d ) s , where s is a 
positive integer which represents the number of levels of the multi-frontal algorithm. Notice that even if this 
assumption is not verified, the final result still holds true provided that the number of degrees of freedom is 
sufficiently large. 

The multi-frontal direct solver algorithm is summarized in algorithm [I] The FLOPS and memory 
required by algorithm [l] can be expressed as 

s-l 

X> c (i)S(i) (4) 

where S(i) is the cost (either FLOPS or memory) of performing each Schur complement at the i th level. 
Using the notation of the previous subsection on the Schur complement, we define q = q{i) as the number 
of interior unknowns of each cluster at the i th step, and r = r{i) as the number of interacting unknowns at 
the i th step. We construct estimates for these numbers and summarize them in table [l] 

Table 1: Number of interior (q) and interacting (r) unknowns at each level i of the multi- frontal solver. 
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Let N be the total number of unknowns in the original system. We use the results from table [T] with the 
FLOPS and memory estimates in equation ([3| to develop table [2] This table describes the cost in FLOPS 
and memory of each level of the multi-frontal algorithm. 

Finally we use equation ^ and table [2] to specialize estimates for C° and C p_1 B-splines in one to three 
spatial dimensions. 
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Algorithm 1 Multi-Frontal Algorithm 



for i = to s — 1 do 

n c = n c (i) = (2 d y-> 

if i — then 

Define N c (0) clusters 
else 

Join the old N c (i — 1) clusters 
Eliminate interior degrees of freedom 
Define N c (i) new clusters 
end if 
end for 



Table 2: FLOPS and memory estimates at each level i of the multi-frontal solver. 
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Estimates for ID C° B-splines. 
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Estimates for ID C v ~ x B-splines. 
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Estimates for 2D C° B-splines. 
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Estimates for 3D C° B-splines. 
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Estimates for 3D C p 1 B-splines. 

s-1 

FLOPS = 2 3s p 6 + 2 3(s " i) 2 6 V = C(2 3s p 6 + 2 6s p 9 ) = 
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4. Numerical Results 

To test the validity of these estimates, we compute solutions to the Laplace equation in three spatial 
dimensions on the unit cube 

-V • (Vu) = on n 
u = on T Da 

u = 1 on Tfli 

(Vu)-n = onrjy 

where il = [0, l] 3 , T D0 - (:, :, 0), F m = (:, 1), and T N = (0, :, :) U (1, :, :) U (:, 0, :) U (:, 1, 0). 

All computational experiments have been performed on a workstation with two quad-core Xeon X5550 
processors and 24 Gb of memory running Fedora 11. The model problem was implemented using PETSc 
[HJ [12] data structures. We used the MUMPS [HI [14] implementation of the multi-frontal algorithm, with 
METIS [H] ordering (nested dissection). We interfaced to MUMPS through PETSc, with the option to 
solve an asymmetric system. Note that only one core was used in these numerical experiments. 

For all numerical results, we relate the FLOPS estimates to the computational time measured for the 
solution of the linear system. The memory estimates we relate to the number of nonzero entries in the LU 
factors. We report the memory required to store an integer and a double precision number for each nonzero 
entry. Note that this is a conservative quantification of memory usage. In general, solvers will require the 
use of additional memory. However, we chose to report the memory required to store the LU factors as it 
is most closely related to the estimates derived in the previous section. 

We tested the estimates by solving the model problem for a linear system containing 100.000 degrees of 
freedom for both C° and C p_1 spaces. This was accomplished by using different numbers of elements. Note 
that while this will affect assembly time, this is not included here in the time reported. 

The numeric results of this test are shown in figure [2] We show computational time and memory usage 
for the C° spaces. The solid line shown in a statistical best fit of the data to the estimate. The excellent 
agreement between the estimates and the real data supports the accuracy of the estimates. 

5. Conclusions 

In this paper, we derive estimates for the computational complexity and memory usage of the multifrontal 
direct solver algorithm applied to finite elements, where the mesh is a structured grid. This restriction is 
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Figure 2: Time and memory for 100,000 degrees of freedom systems along with estimates 



with a view to forming estimates for higher continuous spaces, C p ~ x B-spline spaces. We present numerical 
results to support the validity of these estimates. 

We observe that the estimates for ID, 2D, and 3D are essentially different. This is because of the 
structure of what can be eliminated at each level. Although in ID the number of FLOPS is linear with 
respect the number of unknowns N, this case is not interesting. We note that for C° spaces in two and 
three spatial dimensions, the number of FLOPS is independent of p if the number of unknowns N is large 
enough. Therefore, under the assumption that N is large enough, an adaptive algorithm should select 
refinements based exclusively on the maximum decrease of error per added unknown, independently of 
wether the refinement takes place on h or p, where h here refers to the support of the basis function. This 
efficiency is exploited in the adaptive algorithms used in ftp-finite elements [16 . 

For 2D and 3D, the number of FLOPS of the C p_1 spaces is p 3 times more expensive than the C° 
spaces, provided N is large enough. Thus, an adaptive algorithm for isogeometric analysis should not be 
based exclusively on the maximum decrease of the error per added unknown. It would need to incorporate 
a special treatment of the cost of each added unknown depending upon the type of refinement. 
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